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CNJ ' Abstract 



Bob predicts a future observation based on a sample of size one. Alice can draw 
a sample of any size before issuing her prediction. How much better can she do than 
I/"} 1 Bob? Perhaps surprisingly, under a large class of loss functions, which we refer to 

as the Cover-Hart family, the best Alice can do is to halve Bob's risk. In this sense, 
half the information in an infinite sample is contained in a sample of size one. The 
Cover-Hart family is a convex cone that includes metrics and negative definite func- 
tions, subject to slight regularity conditions. These results may help explain the small 
. relative differences in empirical performance measures in applied classification and 

forecasting problems, as well as the success of reasoning and learning by analogy in 
general, and nearest neighbor techniques in particular. 

Key words: Bayes risk; decision theory; kernel score; loss function; metric; nearest 
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'■ 1 Introduction 

o 

Alice and Bob compete in a game of prediction. The task is to predict a future observation, 
such as a class label, or a real- valued, vector- valued or highly structured outcome. Before 
issuing a point forecast, Alice and Bob may sample from the underlying population. Bob 
has access to a sample of size one only, whereas Alice can draw a sample of any desired 
size. The predictive performance is evaluated by means of a loss function, L(y, y) > 0, 
where y is the point forecast, and y' is the realizing value of the future observation, Y'. 
Intuitively, we expect Alice to do much better than Bob, as she can gather essentially all 
information available, thereby attaining or approximating the Bayes risk, namely 

a = M y EpL(y,Y% 

where Y' has distribution P. However, Cover and Hart (1967) and Cover (1968) showed 
that under misclassification loss and squared error, Bob's risk, /3, is at most twice Alice's 
risk, a, that is, 

a < (3 = EpL(Y, Y') < 2a, (1) 
where Y and Y' are independent with distribution P. 
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In an elegant and thought-provoking discussion, Cover (1977) noted that the inequality 
continues to hold if the loss function is a metric. In this paper, we seek a unifying treatment 
of these remarkable facts. Section |2] identifies large classes of loss functions that satisfy 
the Cover-Hart inequality (OQ), including both metrics and negative definite functions. Sec- 
tion |3] considers probabilistic predictions, where the forecasts take the form of predictive 
probability distributions, and the predictive performance is evaluated by means of a proper 
scoring rule, S(Q, y'), where Q is the predictive probability distribution and y' is the re- 
alizing observation (Gneiting and Raftery 2007). Under the class of kernel scores, which 
includes the Brier score and the continuous ranked probability score, an analogue of the 
Cover-Hart inequality applies, in that 



where, again, Y and Y' are independent with distribution P, and Sy is the point or Dirac 
random probability measure in Y. The paper closes with a discussion in Section 0] where 
we relate to the empirical success of reasoning and learning by analogy in general, and of 
nearest neighbor techniques in particular. 

2 Point predictions based on a single observation 

We now discuss the generality of the Cover-Hart inequality. Toward this end, we let V be 
the family of the Radon probability measures on a Hausdorff space (Q, B), where B is the 
Borel-cr-algebra. We say that a function L : Q x fi — > M. is measurable if it is measurable 
with respect to either argument when the other argument is fixed. 

Definition 2.1. The Cover-Hart class consists of the measurable functions L : x Q — > 
[0, oo) which are such that L(y, y) = for all y E Q, and 



for all probability measures P £ V, where Y and Y' are independent with distribution P. 

Under a loss function in the Cover-Hart class, half the information in an infinite sample 
is contained in a sample of size one, in the sense that predicting a future observation from 
a single past observation incurs at most twice the Bayes risk. 

Theorem 2.2. The Cover-Hart class is a convex cone. 

Proof. Suppose that L x and L 2 belong to the Cover-Hart class and let ci, c 2 > 0. Then the 
convex combination L = ciLi + c 2 L 2 is measurable, L(y, y) = for all y 6 f2, and 



a = E P S(P, Y') < & ee E P S(5 Y , Y') < 2a, 



(2) 



a ee inf w E P L(y, Y') < (3 = E P L(Y, Y') < 2a 



(3) 



E P L(Y,Y') = ctMpL^YX) + c 2 E P L 2 {Y,Y') 



< 2ci mf yi E P Li(yi, Y') + 2c 2 inf w E P L 2 (y 2 , Y') 

< 2 inf, ( Cl E P U(y, Y') + c 2 E P L 2 (y, Y')) 

< 2M y E P L(y,Y') 



for every P G V, whence L belongs to the Cover-Hart class. 



□ 
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The following result is based on a slight extension of an argument of Cover (1977), who 
implicitly assumed the existence of a Bayes rule. 

Theorem 2.3. Any measurable metric belongs to the Cover-Hart class. 

Proof. If L is a measurable metric, then L is nonnegative with h(y,y) = and L(y, y") < 
L(y, y') + h(y', y") for all y, y', y" G fi. Given any P E V there exists a sequence (y n ) in 
Vt such that 

a = mf y E P L(y,Y') = ]im n ^ 00 E P L(y n ,Y'). 

Thus, 

a < P = E P L(Y, Y') < E P (L(y, y n ) + L{y n , Y')) = 2 E P L(y n , y') 

for all integers n — 1,2,... The Cover-Hart inequality © emerges in the limit as n — > oo, 
as desired. □ 

A function L : x Cl — > [0, oo) is said to be a negative definite kernel if it is symmetric 
in its arguments, with L(y, y) = for all y E £1, and 

n n 
i=l j=l 

for all finite systems of points yi, . . . , y n G VL and coefficients ai, . . . , a n G R such that 
ai + • • • + a n — 0. Negative definite kernels play major roles in harmonic analysis (Berg, 
Christensen and Ressel 1984) and in the theory of stochastic processes, where they arise 
as the structure functions or variograms of random functions with stationary increments 
(Gneiting, Sasvari and Schlather 2002). A wealth of examples of such functions can be 
found in the monograph by Berg, Christensen and Ressel (1984) and the references therein. 

Theorem 2.4. Any continuous negative definite kernel belongs to the Cover-Hart class. 

Proof. If EpL(y, Y') = oo for all y, then clearly the Cover-Hart inequality © holds. Thus, 
we may assume that a = inf y EpL(y, Y') is finite. By Theorem 2.1 in Berg, Christensen 
and Ressel (1984, p. 235), 

EpL(Y, Y') + E Q L(Z, Z') < 2 E QiP L(Z, Y% 

where P and Q are Radon measures, Y and Y' have distribution P, Z and Z' have distribu- 
tion Q, and Y, Y', Z, Z' are independent. When Q is the point measure in y G fi, the above 
inequality implies that (3 = EpL(y, Y') < 2 EpL(y, Y') for all y, whence the Cover-Hart 
inequality is satisfied. □ 

We now discuss a few special cases, which are summarized in Table[TJ If is a discrete 
space, the misclassification loss, L(y,y') = t(y ^ y') is a continuous negative definite 
kernel. Thus, Theorem 12.41 applies and reduces to a classical result. When VL is finite, the 
upper bound in the inequality can be strengthened (Cover and Hart 1967). 
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Table 1: Examples of negative definite kernels. Here, Z denotes the integers, E the real numbers, 
and S^ 1 the unit sphere in the Euclidean space M. d , where d > 2. The symbol l(-) stands for an 
indicator function, || • \\ p for the £ p -norm or quasi-norm in R d , and gcd for the geodetic or great 
circle distance on S d_1 . 



Space 


Kernel 




Parameters 


Z 


Hv,y') = 


Hy^y') 




R 


Uy,y') = 


\y-y'\ q 


q € (0, 2] 


R 2 


Hy,y') = 


\\y-y'\\ q P 


pe (2,oo], q e (0,1] 


R d 


Hy,y') = 


\\y-y'\\ q P 


pe (0,2], qe (o, P ] 


s^ 1 


Hy,y') = 


gcd(y,y') 





If f2 is the real line, IR, the squared error loss function, L(y, y') — (y — y') 2 is a continu- 
ous negative definite kernel. For a far-reaching generalization, let || • || p denote the standard 
£ p -norm or quasi-norm in the Euclidean space M d . Schoenberg's theorem (Schoenberg 
1938; Berg, Christensen and Ressel 1984, p. 74) and a strand of literature culminating in 
the work of KoldobskiT (1992) and Zastavnyi (1993) demonstrate that the kernel 

Hy,y') = \\y-yX 

is negative definite under the conditions stated in Tabled] but not otherwise. Theorem 12.41 
applies and the respective loss function is a member of the Cover-Hart class. To give an 
explicit example, if m = 1 and the probability measure P is Gaussian, then a = 2 q ' 2 j3. 

Negative definite kernels can readily be constructed from positive definite functions 
(Schoenberg 1938; Gneiting, Sasvari and Schlather 2002). In this light, graph kernels 
(Borgwardt et al. 2005; Vishwanathan et al. 2010) and related types of positive definite 
functions on discrete structured spaces, as reviewed by Hofmann, Scholkopf and Smola 
(2008), yield Cover-Hart loss functions that are relevant to the prediction of highly struc- 
tured objects, such as strings, trees, graphs and patterns. 

3 Probabilistic predictions based on a single observation 

Thus far, we have studied single-valued point forecasts. In this section, we turn to proba- 
bilistic predictions, where the forecasts take the form of predictive probability distributions 
over future quantities and events (Gneiting 2008). Technically, we retain the above setting 
and let V denote the class of the Radon probability measures on a Hausdorff space (fi, B). 
Predictive performance is evaluated by means of a score, 

S(Q,?/), 

that quantifies the loss when the probabilistic forecast is the Radon probability measure 
Q G V, and the realizing observation is y' 6 VI. 
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A scoring rule thus is a function S : V x Q — > R. It is called proper if 

E P S(P,Y') < E P S(Q,Y') 

for all probability measures P,Q E V, where Y' has distribution P and the expectations 
are assumed to exist. In other words, proper scoring rules encourage careful and honest 
probabilistic predictions and prevent hedging. 

By Theorem 4 of Gneiting and Raftery (2007), proper scoring rules can be constructed 
from negative definite kernels, as follows^ Let L be a nonnegative, continuous negative 
definite kernel. Then the scoring rule 

S(P, y') = EpL(Y, y') - l - E P L(Y, Y') 

is proper relative to the class of the Radon probability measures P for which the expec- 
tation EpL(y, Y') is finite, where Y and Y' are independent with distribution P. Scoring 
rules of this form are referred to as kernel scores, and several of the most popular and 
most frequently used examples belong to this class, including both the Brier score and the 
continuous ranked probability score. 

Under a kernel score, a straightforward calculation leads to a natural analogue of the 
Cover-Hart inequality that applies to probabilistic predictions. Specifically, if we define 
a = infg EpS(Q, Y') and S is a kernel score, a straightforward calculation shows that 

a = E P S(P, Y')<P = E P S(5 Y , Y 1 ) = 2a, (4) 

where Y and Y' are independent with distribution P, and 8y is the point or Dirac random 
probability measure in Y. Again, half the information in an infinite sample is contained in 
a sample of size one, in that probabilistically predicting a future observation from a single 
past observation incurs at most twice the Bayes risk. 

4 Discussion 

Despite being well known in pattern analysis and information theory (see, for example, 
Devroye, Gyorfl and Lugosi 1996), the ground breaking work of Cover and Hart (1967) 
and Cover (1968) has hardly received any attention in the statistical literature. 

In this paper, we have demonstrated that the Cover-Hart inequality CO applies whenever 
the loss function is a measurable metric, or a continuous negative definite kernel. Many but 
not all metrics are negative definite (Meckes 2011), and so the two families may have 
distinct members. An interesting open question is whether or not the Cover-Hart class 

'As pointed out to the author by Jochen Fiedler, Theorem 4 of Gneiting and Raftery (2007) ought to be 
formulated relative to the class of the Radon probability measures on f2, as opposed to the larger class of 
the Borel probability measures. While we are unaware of a counterexample for Borel measures, the result of 
Berg, Christensen and Ressel (1984) used in the proof of the theorem applies to Radon measures only. 
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equals the convex cone that is generated by these two families. In particular, I do not know 
whether or not the Cover-Hart class contains any asymmetric loss functions. 

Typically, predictions are conditional on an information set, leading to natural ramifi- 
cations of single nearest neighbor methods, such as nonparametric regression (Stone 1977) 
and kernel estimators of conditional predictive distributions (Hyndman, Bashtannyk and 
Grunwald 1996; Hall, Wolff and Yao 1999). While we have suppressed the dependence on 
the information set in our work, the Cover-Hart inequality remains valid in this setting, by 
conditioning on and integrating over the information set. 

In this light, if empirically observed mean score differentials exceed 100%, this may 
suggest that forecasters have distinct information sets. A simulation example is reported 
on in Tables 4 and 6 of Gneiting (2011), where the differences in the predictive scores 
between Mr. Bayes and his competitors, whose predictions are based on thoroughly distinct 
information sets, are striking. 

From an applied perspective, the Cover-Hart inequality (J) for point forecasts, and its 
analogue © for probabilistic forecasts, allow for interesting interpretations. Given that 
under many of the most prevalent loss functions used in practice, Alice, despite having an 
infinite sample at her disposal, can at most halve Bob's risk, who has access to a sample 
of size one only, it is not surprising that empirically observed differentials in the predictive 
performance of competing forecasters tend to be small. For example, this was observed in 
the Netflix contest, where predictive performance was measured in terms of the (root mean) 
squared error (Feuerverger, He and Khatri 2012). Taking a much broader perspective, the 
Cover-Hart inequality may contribute to our understanding of the empirical success not 
only of nearest neighbor techniques and their ramifications, but reasoning and learning by 
analogy in general (Gentner and Holyoak 1997). 

Acknowledgements 

The author is grateful to the Alfried Krupp von und zu Behlen Foundation for support, and 
thanks Jochen Fiedler, Marc Genton, Christoph Schnorr and Jon Wellner for discussions 
and references. 

References 

Berg, C, Christensen, J. P. R., and Ressel, P. (1984). Harmonic Analysis on Semigroups. 
New York: Springer. 

Borgwardt, K. M., Ong, C. S., Schonauer, S., Vishwanathan, S. V. N., Smola, A. J. and 
Kriegel, H.-P. (2005). Protein function prediction via graph kernels. Bioinformatics 
21,i47-i56. 

Cover, T. M. (1968). Estimation by the nearest neighbor rule. IEEE Transactions on Infor- 
mation Theory, 14, 50-55. 



6 



Cover, T. M. (1977). Comment. Annals of Statistics, 5, 627-628. 

Cover, T. M. and Hart, P. E. (1967). Nearest neighbor pattern classification. IEEE Trans- 
actions on Information Theory, 13, 21-27. 

Devroye, L., Gyorfi, L., and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recog- 
nition. Springer, New York. 

Feuerverger, A., He, Y. and Khatri, S. (2012). Statistical significance of the Netflix chal- 
lenge. Statistical Science, in press. 

Gentner, D. and Holyoak, K. J. (1997). Reasoning and learning by analogy: Introduction. 

American Psychologist, 52, 32-34. 

Gneiting, T. (2008). Editorial: Probabilistic forecasting. Journal of the Royal Statistical 
Society Series A: Statistics in Society, 171, 319-321. 

Gneiting, T. (2011). Making and evaluating point forecasts. Journal of the American Sta- 
tistical Association, 106, 746-762. 

Gneiting, T. and Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and esti- 
mation. Journal of the American Statistical Association, 102, 359-378. 

Gneiting, T., Sasvari, Z. and Schlather, M. (2001). Analogies and correspondences between 
variograms and covariance functions. Advances in Applied Probability, 33, 617-630. 

Hall, P., Wolff, R. C. L. and Yao, Q, (1999). Methods for estimating a conditional distribu- 
tion function. Journal of the American Statistical Association, 94, 154-163. 

Hofmann, T., Scholkopf, B. and Smola, A. (2008). Kernel methods in machine learning. 
Annals of Statistics , 36, 1171-1220. 

Hyndman, R. J., Bashtannyk, D. M. and Grunwald, G. K. (1996). Estimating and visu- 
alizing conditional densities. Journal of Computational and Graphical Statistics, 5, 
315-336. 

Koldobskn, A. L. (1992). Schoenberg's problem on positive definite functions. St. Peters- 
burg Mathematical Journal, 3, 563-570. 
Meckes, M. W. (2011). Positive definite metric spaces. Preprint jarxiv : 1012 . 58 63v3 . pdf 
Schoenberg, I. J. (1938). Metric spaces and positive definite functions. Transactions of the 
American Mathematical Society, 44, 522-536. 

Stone, C. J. (1977). Consistent nonparametric regression (with discussion). Annals of 
Statistics, 5, 595-645. 

Vishwanathan, S. V. N., Schraudolph, N., Kondor, R., and Borgwardt, K. M. (2010). Graph 

kernels. Journal of Machine Learning Research 11, 1201-1242. 
Zastavnyi, V. P. (1993). Positive definite functions depending on the norm. Russian Journal 

of Mathematical Physics, 1, 511-522. 



7 



